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ABSTRACT 

Bacteria from the genus Streptomyces are 
very important for the production of natural bio- 
active compounds such as antibiotic, antitumour 
or immunosuppressant drugs. Around two-thirds 
of all known natural antibiotics are produced by 
these bacteria. An enormous quantity of crucial 
data related to this genus has been generated 
and published, but so far no freely available 
and comprehensive database exists. Here, we 
present StreptomeDB (http://www.pharmaceutical- 
bioinformatics.de/streptomedb/). To the best of 
our knowledge, this is the largest database of 
natural products isolated from Streptomyces. It 
contains >2400 unique and diverse compounds 
from >1900 different Streptomyces strains 
and substrains. In addition to names and molecular 
structures of the compounds, information about 
source organisms, references, biological role, 
activities and synthesis routes (e.g. polyketide 
synthase derived and non-ribosomal peptides 
derived) is included. Data can be accessed 
through queries on compound names, chemical 
structures or organisms. Extraction from the litera- 
ture was performed through automatic text mining 
of thousands of articles from PubMed, followed by 
manual curation. All annotated compound struc- 
tures can be downloaded from the website and 
applied for in silico screenings for identifying new 
active molecules with undiscovered properties. 

INTRODUCTION 

Streptomyces, a well-studied genus of Gram-positive 
bacteria, belongs to the phylum Actinobacteria. These 
bacteria present a strikingly similar lifestyle to that of 



filamentous fungi and, like those, most streptomycetes 
live as saprophytes in the soil. They also successfully 
inhabit a wide range of other terrestrial and aquatic 
niches, and some strains are plant and animal pathogens 
(1). About 500 different species and thousands of strains 
and isolates are described in the literature (2,3), account- 
ing for an extremely diverse pool of secondary metabolites 
produced from several synthesis routes. In fact, almost 
half of all known natural products (NPs) are produced 
by 'actinomycetes' (mainly Streptomyces) (1,4). Even 
though these soil-dwelling organisms are better known 
as antibiotic producers — over two-thirds of the clinically 
useful antibiotics are isolated from Streptomyces (5) — the 
secondary metabolites have a wide bioactive and thera- 
peutic spectrum. Approved antitumour drugs such as the 
anthracycline antibiotic daunorubicin or the bleomycin 
complex, and autoimmune active agents such as the 
macrolide tacrolimus, among many others, are NPs exclu- 
sively produced by Streptomyces. Both novel and rare 
chemical scaffolds with therapeutically relevant activities 
have been discovered (6-8), like the unprecedented C-5 
spirocyclic fusion found in the antitumour fredericamycin 
A or the unique cyclohexa-l,2,4-triketone moiety 
of fredericamycin E (9). Genetic manipulation of 
Streptomyces has been used to generate highly diverse 
chemical libraries by modification of synthesis routes 
(10-13). Altogether, these facts highlight the renewed 
interest from academia and the pharmaceutical industry 
in exploring NP libraries for compounds with novel scaf- 
folds showing therapeutic activity (14). 

Here, we present StreptomeDB, a database of com- 
pounds isolated from Streptomyces spp. The information 
included was collected from text mining and manual 
curation of thousands of abstracts and full papers using 
a newly developed in-house platform and two external 
databases. StreptomeDB contains data regarding the 
producing strains, the synthesized compounds, their bio- 
logical activity and the synthesis route, if available. It also 
features citations to scientific literature and the chemical 
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structure and physico-chemical properties of the 
compounds. To the best of our knowledge, it is the 
largest compilation of NPs produced by Streptomyces 
spp., including annotations on activities (e.g. antibiotic, 
antitumour or antifungal) and synthesis routes 
(e.g. polyketide synthase (PKS)-, non-ribosomal peptide 
synthase (NRPS)- or terpene-derived compounds). The 
database can be accessed by producer name, compound 
name, similarity and substructure chemical queries, biolo- 
gical activity and synthesis route annotation. 
Furthermore, it features a 'most common substructure se- 
lection' (MCSS) panel containing the most frequent 
occurring substructures within the available chemical 
space, allowing for the fast and efficient selection of 
compound families (e.g. P-lactams and tetracyclines). 

StreptomeDB brings a unique tool to researchers in 
both academia and the pharmaceutical industry for the 
study of secondary metabolites and the discovery of thera- 
peutically relevant novel compounds from natural 
sources. To facilitate the use of the compounds in 
in silico screenings for the identification of new active 
molecules, all structures including their annotations can 
be downloaded from the website as a structure data file. 
The database is freely accessible at http://www. 
pharmaceutical-bioinformatics.de/streptomedb. 

DATA AND METHODS 

Extraction of information in abstracts 

All articles available in PubMed were searched for the 
term 'streptomyces 1 in medical subject heading (MeSH) 
terms, keywords, titles and abstracts. For the resulting 
articles, the abstracts were screened for potential 
compound names using the CIL database (15), yielding 
around 1 5 600 abstracts which potentially contained infor- 
mation on compounds produced by Streptomyces spp. A 
team of seven experts in the field of streptomycetes, their 
products and the mode of action of antibiotics from 
biology, chemistry, bioinformatics and pharmaceutical 
sciences were reading and annotating over 8400 abstracts 
(including all abstracts of the last 3 years) using full texts if 
needed with an in-house software module. Texts were 
searched for the following types of entities: compounds, 
producing organisms, activities of the compound and the 
synthesis pathways. The latter were defined as part of or 
gene cluster for certain pathways specific for the synthesis 
of secondary metabolites such as antibiotics. This included 
terpene, shikimate, ribosomal peptide synthetases (RPSs), 
NRPSs and PKSs pathways. 

Identical test sets containing 10 abstracts were used in 
the beginning to compare and adjust the curation attitude 
and reliability of the different curators in three rounds 
with subsequent refinements of entity definitions, resulting 
in fixed and mandatory guidelines for curation. Unique 
identifiers were assigned to the terms of the annotated 
entities. For most compounds, structural descriptions 
were inherited from the PubChem database if available 
or drawn. Organism names were unified and organized 
in a 'main organism' and 'strains/mutant' hierarchy. 
Activities and synthesis routes were stored with the 



annotated text parts and additionally classified by 
keywords. 

Curation work yielded around 5700 abstracts and full 
texts containing information on Streptomyces spp. 
producing one or more compounds and describing 
compound activities or synthesis routes. All remaining 
articles contained no such information in the abstracts 
or available full texts. In 2250 annotated abstracts, 
compound and organism names did not allow for the as- 
signment of unique identifiers. Thus, they are subject to 
on-going curation. 

Inclusion of data from existing data sources 

To complete, enlarge and confirm the body of available 
information in StreptomeDB, existing data sources with 
new and overlapping data were used: the thesaurus of the 
MeSH, the KNApSAcK database (16) and the Novel 
Antibiotics DataBase (NADB) (http://wwwO.nih.go.jp/ 
~jun/NADB/search.html), containing substances first 
reported in the Journal of Antibiotics (http://www.anti 
biotics.or.jp/journal/ja-top.htm). Descriptions of MeSH 
were queried for compounds with descriptions specifying 
an organism which could be uniquely identified. In MeSH 
descriptions, 83 compound-organism relationships could 
be found and unique identifiers for compounds and organ- 
isms were assigned. Data from the NADB are available 
through its 'Namazu: a full text search engine' interface 
(http://wwwO.nih.go.jp/~jun/NADB/namazu.cgi7query = 
streptomyces, accessed 25 July 2012). It provided 2557 
abstracts of which 1225 unique identifiers could be as- 
signed to compounds and organisms. The KNApSAcK 
database contains 1988 metabolite-Streptomyces spp. 
relationships with one or more associated literature 
references (http : //kanay a . naist .jp/knapsack J sp/result .j sp? 
sname = organism&word = streptomyces, accessed 25 July 
2012). For 1245 of those compound-organism references, 
unique identifiers for compounds and organisms could be 
assigned. 

Failure of assignment of unique identifiers 

Failure of assignment of unique identifiers to compound 
and organism names in the three existing data sources and 
the curated abstracts was caused by ambiguous names, 
spelling and curation errors which could not be traced 
back, or compound names and synonyms which were 
not available in PubChem or any other freely accessible 
database. 

Searching with most common substructures 

To enable the user to search with common substructures, 
all compounds were fragmented using the RECAP 
algorithm (17). The 120 most frequent fragments contain- 
ing a cyclic structure occurring as substructures in all 
StreptomeDB compounds are presented to the user for 
selection. 

Data were stored in a PostgreSQL database 
(PostgreSQL 8.4.8, PostgreSQL Global Development 
Group). All calculations of chemical properties were 
executed using Open Babel 2.3.0 (18). 
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RESULTS 

Compound diversity 

The database contains >2400 different secondary metab- 
olites produced by Streptomyces spp. (Table 1). Diverse 
compound classes such as anthracyclines and cyclic 
peptides are included (Figure 1). The large proportion of 
very complex compounds results in a much higher average 
molecular weight (median: 453g/mol) compared with 
typical drugs (median: 310g/mol, 19). For example, the 
approved antibiotic drug actinomycin has a molecular 
weight of 1255g/mol. 

The MCSS panel for compound selection (Figure 2) is 
based on substructures that can be synthetically combined 
and are common in drug-like molecules. The panel high- 
lights the large diversity of the rings present in the NPs of 
StreptomeDB (Figure 1). For example, 60 P-lactam anti- 
biotics and 50 antitumour anthracyclines are included in 
the database (20). The MCSS panel allows a direct selec- 
tion and identification of compounds containing such 
substructures. 

Activities and synthesis routes 

Many pharmaceutical^ relevant peptides are synthesized 
by NRPSs (21). Such compounds are characterized by an 
extremely broad range of biological activities, pharmaco- 
logical properties and rare structural features. NRPS- 
derived compounds annotated in StreptomeDB include 
cytostatic agents such as epothilone or bleomycin, or anti- 
biotics such as daptomycin or enduracidin. For other 
compounds of that class, the broad range of activities 
has not been clarified yet and remains subject to further 
research (22). 

The 217 compounds annotated as PKSs derived are the 
largest group of StreptomeDB. Analogous to NRPS- 
derived compounds, they are an important source for 
pharmacologically relevant molecules (23). Many of the 
therapeutically used antibiotics, such as tetracyclines are 
produced by PKS. Examples of important polyketides 
included in StreptomeDB with annotated activity are 
the antibacterial oxytetracycline and the anticancer 
geldanamycin (24). Additional synthesis routes include 
terpene synthesis or RPS. 

For 875 NPs, activity information is included in the 
database. The annotations contain very specific descrip- 
tions (e.g. inhibitor of protein X) as well as more unspe- 
cific classifications (e.g. antibiotic). In total, 71 different 
activity classifications have been included in the database. 

Case studies 

The database enables fast delivery of information related 
to research in pharmaceutical sciences or chemistry. 



Some examples for possible applications are explained in 
the following. 

Drug discovery 

Many NPs have therapeutically relevant activities (25). To 
start a general search for bioactive compounds, it is useful 
to have a deeper look on substructures known to be active 
and which appear in several drugs (26). For example, 
phenazine compounds possess activities on several target 
proteins because of their ability to promote electron 
transfer (27). Particularly, the effect on G-protein- 
coupled receptors (GPCRs) is demonstrated by several 
phenazine-containing drugs (28). 

StreptomeDB contains 26 phenazines. Structures can be 
easily selected through the MCSS panel (Figure 2). 
Assaying the activities of NPs on GPCRs or other 
known drug targets may reveal some new specific thera- 
peutic functions. Recently, Ohlendorf et al. (29) could 
show that the phenazine geranylphenazinediol extracted 
from a marine Streptomyces spp. is a potent inhibitor of 
the human acetylcholinesterase. 

Search for enzymes able to catalyse specific 
synthesis steps 

Aziridine-containing NPs are extremely rare (30). 
However, the DNA alkylating and crosslinking activities 
of aziridine analogues have been shown to be attractive 
for the development of anti-leukaemia therapeutics (31). 
The knowledge about the organisms, gene clusters and 
enzymes that can synthesize such ring systems may 
enable the design of chemotherapeutic agents with 
enhanced stability and tumour selectivity which can be 
produced by genetically modified enzymes (30). 

A substructure search in StreptomeDB starting with 
aziridine results in 13 compounds containing this func- 
tional group (Figure 3). The related links lead to the 
producing organisms and to the source publications. 
Specifically, Zhao et al. (32) describe a PKS-gene cluster 
which is involved in the pathway for the synthesis of 
azinomycin B. This is a very good starting point for the 
identification of associated genes and the investigation of 
the enzymatic mechanism responsible for building import- 
ant groups such as aziridine rings. 

Search for antibiotics effective against newly 
resistant bacteria 

Overtreatment with antibiotics has led to bacterial strains 
resistant against several known antibiotics (33). Yet, no 
longer used antibiotics have been successfully adminis- 
tered in the treatment of therapeutic infections caused 
by resistant bacteria (34). 

In StreptomeDB, many antibiotics are described which 
were never or rarely used as clinical drugs. An activity 



Table 1. Content of StreptomeDB (accessed 9 October 2012) 



Compounds (No.) 


2444 


Annotated PKS-synthesized compounds (No.) 


256 


Molecular weight [g/mol] (median) 


452.5 


Annotated NRPS-synthesized compounds (No.) 


51 


Compounds that fit to the Lipinski's Rule of Five (No.) 


1522 


Different organisms, including strains (No.) 


1985 


Compounds annotated with activity (No.) 


1036 


Referenced articles (No.) 


4544 


Compounds annotated with structure (No.) 


2444 


Number of compound-organism relationships (No.) 


4341 
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Figure 1. Examples of NPs produced by Streptomyces spp. included in StreptomeDB. (a) Enduracidin A, NRPS-derived antibiotic; (b) daunorubicin, 
antitumour anthracycline; (c) oxytetracycline, tetracycline antibiotic; (d) bleomycin, NRPS/PKS-derived anticancer; (e) geldanamycin, macro-cyclic 
Hsp90 inhibitor; (f) epothilone B, anticancer macro-cycle; (g) fredericamycin A, C-5 spirocyclic DNA-polymerase inhibitor; (h) chloramphenicol, 
broad-spectrum antibiotic; (i) tacrolimus, macro-cyclic immunosuppressive; (j) fosfomycin, broad-spectrum antibiotic; (k) geranylphenazinediol, 
acetyl-CoA inhibitor; (1) actinomycin, NRPS/PKS-derived antineoplastic agent and (m) daptomycin, NRPS-derived antibiotic. 



search in the database starting with 'antibiotic' results in a 
list of described antibiotics with the related publications. 
Antibiotics described long ago such as fosfomycin or 
chloramphenicol may be still active against highly resist- 
ant strains when administered either on their own or in 
combination with other antibiotics (35,36). 

DISCUSSION AND FUTURE PROSPECTS 

Different statements exist about the total number of NPs 
which are synthesized by Streptomyces spp. Some years 
ago, Berdy (4) reported a number of 3000 bioactivities 
but the referenced source database is no longer available 
and it remains unclear which compounds are meant. 
KNApSAcK and NADB databases list ~1500 compounds 
which are synthesized by Streptomyces. StreptomeDB has 
currently annotated around 2400 unique compounds but 
the number is continuously increasing. One reason for not 
including all compounds that are described is that import- 
ant information is included in the full texts but is not ne- 
cessarily in the abstracts of articles. The detailed analysis 
of related full texts is subject to an on-going project which 
will further increase the dataset not only for compounds 
but also for activity data and synthesis routes. Newly 



published articles are curated and included to the 
database on a three-monthly basis. 

Furthermore, several hundreds of compounds 
annotated in StreptomeDB were not included in the 
PubChem database and therefore could not be assigned 
easily with structures. Since we have started to draw many 
of those compounds manually, the assignment of all of 
them with chemical structures is intended. 

The interest of researchers in genomic data encoding for 
synthesizing enzymes of NPs is of growing interest. 
Knowledge about the enzymatic mechanisms of important 
synthesis steps enables the production of chemicals which 
can be produced in engineered organisms (10-13). 
Databases describing genes and gene clusters encoding 
for important enzymes involved in pathways for the pro- 
duction of NPs exist [e.g. NORINE (22) and NRPS-PKS 
(37)] but a lot of important data are still hidden in publi- 
cations as text information. StreptomeDB complements 
existing datasets and supports data collection projects 
dealing with biological chemistry by allowing recognition 
of organisms containing enzymes which are able to 
catalyse important functional groups and specific synthe- 
sis steps. More detailed information about involved 
pathways and related genes responsible for compound 
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Most Common Substructure Selection 



Select one or more substructures which will be searched in all available compounds. Click on the structures to select or 
unselect. 
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Figure 2. MCSS panel in StreptomeDB, featuring the most common cyclic structures included in the database. 
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Figure 3. Workflow of a substructure search for arizidine-containing compounds in StreptomeDB. 



synthesis would be of interest for the research community. 
Even though most of this information is not yet described 
in the literature, the inclusion of the available data will be 
part of StreptomeDB in future updates. 



We provide the complete dataset for download 
including the structural information. This opens the pos- 
sibility for modelling molecular interactions on the struc- 
tural level. In silico screening approaches for the 
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identification of new drugs are becoming more and more 
important (38-40). We believe that the presented dataset is 
a valuable molecular library for the virtual screening of 
therapeutically important target proteins. For many NPs, 
the complete activity spectrum is not clarified yet. The 
supplied data will help identifying NPs or analogues 
useful as candidates of new active compounds. 



CONCLUSION 

To the best of our knowledge, StreptomeDB is the largest 
database describing NPs produced by streptomycetes. 
Downloadable chemical structures allow for the applica- 
tion in virtual screening. Collected data will support 
analyses of gene clusters and associated enzymes respon- 
sible for the synthesis of functional groups. Streptomyces 
is the most important genus for the production of thera- 
peutic NPs. Thus, the database will be of interest for re- 
searchers working in the area of drug discovery and 
chemistry. 
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